A Dataset can have multiple problems with it and before putting the data into a machine learning model or before doing any statistical analysis, it is required that data should be cleaned first. It can have issues such as
As you are already familiar with titanic dataset, but we have added a few more scenarios into it for this tutorial purpose.
1. Inappropriate Column names
Column names should be consistent enough across the dataset. Sometimes there can be spaces in between the column names, or may be in the start or at the end, and it becomes hard to see through naked eyes. One of the good approaches is trying “.columns” with the data frame name, it outputs the column names with quotes.
Code:
df = pd.read_csv(“data_set.csv”)
print(df.columns)
Output:
Index(['Passenger Id', 'Survived ', 'Pclass', 'Name', 'Gender', 'Age', 'Fare', 'Embarked'], dtype='object')As you can clearly see, there is space present in PassengerId column and space present at the end of Survived column.
How to Fix?
We can rename the column names as per our wish.
df.rename(columns = { "Passenger Id" : "Passenger_Id",
"Survived ": "Survived"
}, inplace = True)
2) Inappropriate column data types
A user name column can not have integer or float data type, similarly age or salary column can not have their data type as text.
But when we see such scenarios, there is a requirement of fixing the issue.
Let us check the data types of the given data set:
Code:
df.dtypes
Output:
Passenger_Id int64
Survived object
Pclass int64
Name object
Gender object
Age float64
Fare int64
Embarked object
All the data types are correct except the Survived column. As it contains 0 (not survived) or 1 (survived), then the data type of this column must be int or float. Let us look at the data for this column.
Code:
df[‘Survived’].value_counts(dropna = False)
Output:
0 12
1 11
male 1
As we can see, other than 0 and 1, the erroneous value i.e. ‘male’ is also present.
How to Fix?
- Either you can drop this particular record from the data set itself
df = df[df[‘Survived’] != ‘male’]
- Or You can set the erroneous values to Null and later on you can impute the Null values
df['Survived'] = pd.to_numeric(df['Survived'], errors='coerce')
In the highlighted row, it is clear that after the fix, the erroneous value has been replaced with NaN and the column type has been changed to float from text.
There are 3 options available with errors argument, you can try all of them and can see their behaviour.
errors: {‘ignore’, ‘raise’, ‘coerce’}, default is ‘raise’
3) Missing values in Data
Columns can have some of their data values as missing.
We can drop those particular records from the data set itself, but this may lead to information loss. Alternatively, we can impute the missing data values.
Imputation (filling out the missing values) can be performed by:
mean_value = df[‘age’].mean()
df[‘age’] = df[‘age’].fillna(mean_value)
4) Outliers in Data
Outliers are extreme values whose chance of occurrence is low. A Data set can have outliers, and while performing any analysis, it is necessary to handle outliers because outliers in data can affect your analysis deeply.
Examples:
So all these are the cases where data values are so extreme that their chance of occurrence is very low, means usual/general data points are different from these cases.
- An old man can usually age between 70 to 90
- An employee salary can usually range from 4 to 10 Lacs per Annum
There are multiple ways to handle the outliers:
1) Easiest way is to drop the records where you find the outlier. But this is the worst treatment as we are losing the information
2) We can cap the Outlier values to some threshold.
Set all the ages to 80 if it crosses 80 years of age
Set all the salaries to 15 if it crosses 15 Lacs
3) We can use transformations. Transformation like log, changes the scale of data and outliers remains no longer an outlier.
4) We can assign outliers to some other category.
For example, we can set all the rare data points value as -1
5) Duplicate Data
When the data set contains duplicate records, it affects the result. Suppose there is one particular type of record which is an outlier and it has multiple duplicate records of its kind present in the dataset, then it will adversely affect your analysis. So, it is recommended to drop the duplicates before performing your analysis.
To drop duplicate records:
df.drop_duplicates( [‘customer_id’], axis = 1, inplace = True)
This will make sure that data set should contain only single entry per customer
df.drop_duplicates(inplace = True)
This will make sure that each row is different from each other.
We collect cookies and may share with 3rd party vendors for analytics, advertising and to enhance your experience. You can read more about our cookie policy by clicking on the 'Learn More' Button. By Clicking 'Accept', you agree to use our cookie technology.
Our Privacy policy can be found by clicking here